Pengantar Pemrograman Triton: Paradoks Kinerja: Mengapa Kode yang Benar Justru Lambat

Paradoks Paradoks Kinerja menyatakan bahwa kernel yang sempurna secara matematis, seperti $out = x + y$, justru dapat berkinerja lebih buruk daripada loop CPU jika gagal menyerap biaya tetap dari perangkat keras GPU. Hal ini sering muncul sebagai Biaya Pemulaan.

1. Kesalahan "Ketepatan"

Ketepatan fungsional bukanlah pengganti efisiensi. Meskipun kode Triton Anda dengan benar mendistribusikan pekerjaan ke ribuan thread, jika jumlah total pekerjaan (N) kecil, GPU akan tetap tidak termanfaatkan secara optimal. Perangkat keras menghabiskan lebih banyak waktu dalam transisi status daripada melakukan perhitungan sebenarnya.

2. Perangkap Pengukuran Python

Mengukur kinerja kode GPU dari Python menggunakan time.time() adalah berbahaya. Panggilan GPU bersifat asinkron; Python hanya mengantrekan perintah dan melanjutkan. Tanpa torch.cuda.synchronize(), Anda mengukur waktu antrean. Dengan sinkronisasi, Anda mengukur latensi Host-to-Device, yang sering kali 10 kali lebih lama daripada eksekusi kernel itu sendiri.

3. Latensi vs. Throughput

Untuk mengatasi paradoks ini, Anda harus menyediakan cukup banyak pekerjaan agar "menyembunyikan" latensi pemulaan. Ini adalah transisi dari mode terbatas oleh latensi yang dibatasi oleh bus CPU-GPU menjadi mode terbatas oleh throughput yang dibatasi oleh memori atau komputasi GPU.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

For each kernel, decide whether the bottleneck is likely arithmetic throughput, memory bandwidth, or launch overhead: Vector addition (N=256), Vector addition (N=10^8), and Matrix Multiplication (N=8192).

N=256: Arithmetic; N=10^8: Bandwidth; MM: Launch

N=256: Launch; N=10^8: Bandwidth; MM: Arithmetic

N=256: Bandwidth; N=10^8: Arithmetic; MM: Launch

All are compute-bound.

QUESTION 2

In the context of the Performance Paradox, what is the primary bottleneck for a 'ReLU on a matrix' operation?

Arithmetic Throughput

Memory Bandwidth

L1 Cache Size

QUESTION 3

What does the term 'Asynchronous Execution' imply regarding GPU benchmarking?

The GPU and CPU always finish at the same time.

The CPU continues to the next line of code before the GPU kernel finishes.

The kernel runs faster on smaller GPUs.

Memory transfers are blocked by compute.

QUESTION 4

Why does $out = x + y$ exhibit low arithmetic intensity?

It uses three memory accesses (2 loads, 1 store) for a single floating-point operation.

The addition operation is too complex for the ALUs.

It requires shared memory synchronization.

It only runs on one SM.

QUESTION 5

How can the 'Launch Tax' be amortized in a real-world application?

By calling the kernel more frequently with smaller data.

By increasing the workload per launch (e.g., larger N or batching).

By using 16-bit floats instead of 32-bit floats.

By disabling the L2 cache.